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\/~} ' Abstract 

o ■ 

. This paper presents a new context-free parsing algorithm based on a bidirectional 

strictly horizontal strategy which incorporates strong top-down predictions (deriva- 
tions and adjacencies). From a functional point of view, the parser is able to propagate 
syntactic constraints reducing parsing ambiguity. From a computational perspective, 
£jr)! the algorithm includes different techniques aimed at the improvement of the manipu- 

lation and representation of the structures used. 

I: ... . . 

1 Parsing Ambiguity and Parsing Efficiency 



In Formal Language Theory []Aho Sz Ullman 1972 , Drobot 1989( | a language is a set, 



and in Set Theory an element belongs or not to a set. That is to say, a set (and 
therefore a language) is an unambiguous structure. A grammar may be considered as 
an intensive definition of a language. Thus, the notion of grammaticality corresponds 
to the relation of membership over a language (set). But a grammar incorporates 
more information than the simple report of the elements of the language (the extensive 
specification). A grammar defines a structure: the parse tree or forest. The distance 
between grammaticality and grammatical structure is a first level of ambiguity: gram- 
matical ambiguity. 

The next notion to take into account is the process of analysis of a string of words 
with a grammar, that is, the parser |Kay 1980| , |Bolc 1987| , |Sikkel fc Nijholt 19971 - A 



parser must be able to determine the relation of grammaticality and to obtain the 
grammatical structure, by mean of a set of operations, that we will call the parsing 
structure. The distance between the grammatical structure and the parsing structure 
defines a second level of ambiguity: parsing ambiguity, usually referred as temporal 
ambiguity. 

Parsing ambiguity depends on two factors: the grammar and the parsing strategy. 
A very important design requirement of natural language parsers is to eliminate parsing 
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ambiguity, that is, to reduce the work done by the parser to the amount of grammatical 
structures allowed by the grammar. The work presented here is a step more in this 
direction [[Earley 1970| , |Kay 1980| , |Tomita 19871 , [Tomita 199 j| , powding et al. 1994 ?]■ 

The second goal of this paper is to present a computational model aimed at the im- 
provement of the efficiency of the algorithm | Carroll 1994 , |Quesada Sz Amores Forthcoming ! 
In this sense, our proposal may be understood as the incorporation of strong top-down 
predictions (partial derivations and adjacencies) over a bottom-up framework. 

And the two strategies (bottom-up and top-down) are mixed by a mechanism 
able to propagate syntactic constraints over a bidirectional model based on a strictly 
horizontal strategy [ Quesada 1996 ] . 

Section 2 presents an informal introduction to the problem of parsing ambigu- 
ity with chart parsing | Kay 1980 1, but similar situations may be described for other 
strategies like Earley's algorithm Parley 1970 1, °CG [|Pereira & Warren 1980] , GLR 
[ Chapman 1987 , [Tomita 1991 ], etc. Section 3 defines formally the relations that sup- 
port the mechanism of bottom-up bidirectional analysis, top-down predictions and 
constraints propagation. Section 4 presents in detail the parsing algorithm and finally 
Section 5 shows some experimental results. 



2 An Informal Introduction 

Let us consider the following grammar: 

s -> Al b 
S -> A2 c 
Al -> a 
Al -> a Al 
A2 -> a 
A2 -> a A2 

and the string of words a a a b. Figure 1 shows the arcs genereted by a bidirectional 
chart parser in a first stage where we have created only the arcs with at least one pre- 
terminal symbol. Each arc has been identified by a number, and indicates the symbol 
that the arc will generate and the expected symbol (only for active arcs). 



A2 A2 A2 




Figure 1 



Let us consider what happens at position 3 . There exists an obvious relation 
between arcs 13 and 9, but arcs 10, 11 and 12 don't have a correspondent link at this 
position. Figure 2 shows the parsing state once we have deleted these three arcs. 



. b 



Figure 2 



At position 2 there exists a relation between arcs 7 and 9, and arcs 5, 6 and 8 may 



be deleted. Once we have deleted these three arcs, if we analyze position [T] we can 
delete now arcs 1, 2 and 4 obtaining Figure 3. 



a U b 



Figure 3 



Therefore, our next goal will be to define formally the relations between arcs that 
guarantee their success during parsing. 



3 The Mathematical Kernel 

3.1 Bottom-Up Derivation 

Given a context-free grammar G =< Gt,Gn,Gp,Gr > where we have distinguished 
their set of productions Gp, roots Gp, terminal symbols Gt, non-terminal symbols Gat 
and vocabulary Gy = Gt U Gn, we will define the bottom- up derivation as follows. 
Let be 5 € Gy and A,T, f2 € G y . The direct bottom-up derivation in G, — >q, is 
defined as: 

TAQ — > G T5Q iff 5 — ► A e Gp 

The bottom-up derivation in G, ^=^g> will be defined as the reflexive and transitive 
closure of the direct bottom-up derivation: 

T => G Q iff 3 A x , . . . , A n G G v such that Vi(i< i<n ) Aj — > G A i+1 
where Ai = T and A n = Q 

3.2 Partial Derivability and Adjacency 

Root Symbols, a is a root symbol: R(a) iff a G Gp 

Epsilon Symbols, a is an epsilon symbol: E(a) iff e =>g a Q 
1 e is the empty string. 



String of Epsilon Symbols. A is a string of epsilon symbols: E(A) iff V5 G 
A(E(5)) 



Left Partial Derivability. [3 is a left partial derivation of a: 

a i — /3 iff 3T, A, n G G* v such that (r«A =^ G T/3Q). 
We define LPD{a) = {(3 e G v : a i — ^ /?} U {a} Q 

Right Partial Derivability. /3 is a right partial derivation of a: 

a i — >* r (3 iff 3r, A,Q£G* V such that (r a A ==>g fi/JA) 
We define RPD(a) = {(3 G Gy : a i — /?} U {a} g 

Primary Adjacency. /? is a primary adjacent of a: 

a ft p iff 35 G Gy and 3r, O, A G G* v such that (5 — ► FaA(3n G G P A £(A)) 

Left Adjacency. j3 is a left adjacent of a: 

a ft? /? iff 37 G LPD(a) and 35 G RPD{j3) such that (5 ft 7) 
We define LA(a) = {(3 G : a ft,* /?}. 

Right Adjacency. (3 is a right adjacent of a: 

a ft; iff 3 7 G RPD(a) and 35 G LPD{j3) such that (7 ft 5) 
We define flA(a) = {(3 £ G v : a ft*, /?}. 

Left— Most Symbol, a is a left-most symbol: 

LM(a) iff 35 G Gy such that (a 1 — ►* 5 A R{5)) 

Right— Most Symbol, a is a right-most symbol: 

RM{a) iff 35 G G v such that (a 1 — 5 A R{6)) 

3.3 Coverage Tables 

Finally we present the formal definition of the coverage tables which are in charge of 
triggering the events of the bidirectional parser. 

For each symbol of a grammar, a G Gy, their left, LCl(a) and LC2(a), medium, 
MG(a), and right, RC(a), coverages are defined as sets of productions in the following 
way: 

LCl(a) = {(5 — >aeG P ):5e G N } 
LC2(a) = {(5 — ► aU G G P ) : 5 G G N A O G G y A -.#(0)} 
MG(q) = {(5 — ► Aaft G Gp) : 5 G Gat A 0, A g G v A --E(A) A -.£(0)} 
i?G(a) = {(5 — ► Aa G G P ) : 5 G G N A A G G* v A -£(A)} 

2 We will consider that a symbol is a left partial derivation of itself. 
3 We will consider that a symbol is a right partial derivation of itself. 



4 The Parsing Algorithm 



4.1 Parsing Input. 

The main task of the lexical analyzer is to separate the input string in a set of items, 
each one associated with one or more (lexical ambiguity) pre-terminal symbols (syntac- 
tic categories) . Our parsing algorithm is also able to deal with "multi-word expressions" 
and "multi-expression words''^. 

In any case, the parsing input will be a list of breaking points and a set of pre- 
terminal symbols, each one associated with a lexical unit (a portion of the input string) 
and two breaking points. For instance, we can consider the input string a a a b. This 
string will be lexically analyzed obtaining 5 breaking points and 4 pre-terminal symbols: 



]3 a CD 



3 6 4 



4.2 Step 1: CaD creation. 

For each breaking point we will generate a CaD (collection and diffusion of information) 
structure, which has 6 fields: the first four fields are lists of Events and the two last 
ones are lists of Nodes: tole (events arriving at the CaD from the right side), frle 
(events going to the left from the CaD), tori (events arriving at the CaD from the 
left side), frri (events going to the right from the CaD), ndle (nodes at the left of the 
CaD) and ndri (nodes at the right of the CaD). 

If the lexical analyzer has obtained n breaking points, then we will store the CaD 
structures as a matrix of n pointers to CaD structures. We will call this matrix 
CaD roo t. 



4.3 Step 2: Node creation. 

For each element <lexical_unit ,pre-terminal_symbol,fbp[] ,lbp^ > we will gener- 
ate a Node structure, which has the following fields: grsymbol (grammar symbol) and 
cmanalysis (complex analysis, a list of Analysis structures). The new node newNode 
will be associated with the corresponding CaD structures: 

CaD roo t[fbp]->ndri = AddNode{newNode) 
C aD root [lbp]->ndle = AddN ode{newN ode) 



4.4 Step 3: Event creation. 

For each node created at step 2, we will generate their correspondent events using the 
coverage tables. An Event has the following fields: grprod (production or grammar 
rule), leftdot (left dot), rightdot (right dot), leftlinks (list of Link structures associated 
with the left extreme), rightlinks (list of Link structures associated with the right 
extreme) and status (logical status). Let us suppose that the node created (new Node) 
has been associated with the grammar symbol a. Then: 

For each production p € LC\(a) we will create the appropriate new event (newEvent) 
and: 

4 Words that contain more than one lexical unit, such as clitics in Spanish or compounds in German. 
5 The first or left breaking point of the lexical unit. 
6 The last or right breaking point of the lexical unit. 



CaD roo t[fbp]->frri = AddEvent(newEvent) 
CaD root [lbp]->frle = AddEvent(new Event) 

For each production p G LC2(a) we will create the appropriate new event (newEvent) 
and: 

CaD roo t[fbp]->frri = AddEvent(newEvent) 
CaD root [lbp]->tole = AddEvent(newEvent) 

For each production p G MC(a) we will create the appropriate new event (newEvent) 
and: 

C aD root [fbp\->tori = AddEvent(new Event) 
C aD root [lbp]->tole = AddEvent(new Event) 

For each production p G RC(a) we will create the appropriate new event (newEvent) 
and: 

C aD root [fbp]->tori = AddEvent(new Event) 
CaD root [lbp]->frle = AddEvent(new Event) 

4.5 Step 4: Link creation. 

For each event created we have to analyze their possible links with other events. This 
operation is internal to the CaD structure according to the following criteria: 
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Figure 4. Links inside a CaD structure. 



4.5.1 Analyses of Partial Derivations 

These analyses are applied over the open extremes of an event. Basically, we have to 
check if the symbol needed is partially derivable from the real symbol. 

We will distinguish two types of analyses of derivations depending on the direction 
of the open extreme of the event. 

Left— Derivation Analysis: TOLE — FRRI. Let us suppose that an event 
evtole is applying the production 5 — ► 5\ . . .8 n over the surface T5h ■ ■ ■ <5i7i . . . 7jfi 
where 1 < h < i < n and 1 < j ' < m. In fact, this event is making a prediction of 
the (required) symbol <5j + i over the (real) symbol 71. The right extreme of this event 
will be associated with the component tole of a CaD structure. In this case, we have 
to check if there exists a second event, evfrri, which left extreme belongs to the frri 
field of the same CaD, such that <5j + i G LPD(j), where 7 is the left-hand side of the 
production associated with evfrri: 7 — ► 71 ■ • • 7m- 



Right— Derivation Analysis: TORI — FRLE. Let us suppose that an event 
evtori is applying the production S — ► S± . . .S n over the surface T^j . . . ^y m 5h ■ ■ ■ SiQ 
where 1 < h < i < n and 1 < j < m. In fact, this event is making a prediction of the 
(required) symbol Sh-i over the (real) symbol j m . The left extreme of this event will 
be associated with the component tori of a CaD structure. In this case, we have to 
check if there exists a second event, evfrle, which right extreme belongs to the frle 
field of the same CaD, such that 5h-i G RPD(j), where 7 is the left-hand side of the 
production associated with evfrle: 7 — ► 71 • • • 7m- 

4.5.2 Analyses of Adjacencies 

These analyses are applied over the closed extremes of an event. Basically, we have to 
check the adjacency relation between the symbol that the event will generate (the left- 
hand side of the production) and the symbol that appears next to the closed extreme 
of the event. 

We will distinguish two types of analyses of adjacencies depending on the direction 
of the closed extreme of the event. 

Left- Adjacency Analysis: FRRI — FRLE. Let us suppose that an event 
evfrri is applying the production 5 — ► S± . . .S n over the surface T^j . . . 7 m <5i . . . SiQ 
where 1 < i < n and 1 < j 1 < m. The left extreme of this event belongs to the frri 
field of a CaD structure. Then we have to analyze if there exists an evfrle event which 
right extreme belongs to the frle field of the same CaD such that 7 G LA{8), where 
7 is the lhs of the production of evfrle: 7 — ► 71 • • • 7m- 

If T^j . . . 7 m is empty, that is, the CaD associated with the left extreme of evfrri 
is the first one, then we have to check if LM(5). 

Right- Adjacency Analysis: FRLE — FRRL Let us suppose that an event 
evfrle is applying the production 5 — ► 5\ . . .5 n over the surface T5i . . . <5„7i . . . 7^ A 
where 1 < i < n and 1 < j < m. The right extreme of this event belongs to the frle 
field of a CaD structure. Then we have to analyze if there exists an evfrri event which 
left extreme belongs to the frri field of the same CaD such that 7 G RA(S), where 7 
is the lhs of the production of evfrri: 7 — ► 71 • • • 7m- 

If 71 . . . 7j A is empty, that is, the CaD associated with the left extreme of evfrle 
is the last one, then we have to check if RM{5). 

4.5.3 Analyses of Fusions 

Left-Fusion Analysis: TORI — TOLE. Let us suppose that an event evtori is 
applying the production 5 — ► 5\ . . . S n over the surface T^f5i . . .SjA where 1 < i < j ' < 
n. The left extreme of this event belongs to the tori field of a CaD structure. Then 
we have to analyze if there exists an evtole event in the tole field of the same CaD 
such that evtole is applying the same production that evtori over the surface 5h ■ ■ ■ <5j-i 
where 1 < h. 

Right— Fusion Analysis: TOLE — TORI. Let us suppose that an event evtole 
is applying the production 5 — ► 5\ . . . 5 n over the surface T5i . . . <5j7A where 1 < i < 
j < n. The right extreme of this event belongs to the tole field of a CaD structure. 
Then we have to analyze if there exists an evtori event in the tori field of the same 



CaD such that evtori is applying the same production that evtole over the surface 
5j + i . . . 5k where k < n. 

Link creation. Each time an analysis is successfull, we will generate a Link struc- 
ture between the two events involved. For LM and RM analysis the Link will have 
only one event. 

4.6 Step 5: Event's Logical Status. 

Each time a Link is created we have to study the logical status of the events involved. 
Also, at the end of the analysis of the links of an event (step 4) we will analyze its 
logical status. 

Let be e an event applying the production 5 — ► S± . . . <5j_i • 5i . . . 5j • 5j + \ . . . S n 

4.6.1 Closed-Closed events (FRRI + FRLE): i = 1 and j = n. 

if (( !e->leftlinks) || ( ! e->rightlinks) ) 
nstatus = DELETE 

else 

nstatus = RUN 

4.6.2 Closed-Open events (FRRI + TORI): i = 1 and j < n. 

if ( !e->leftlinks) 

nstatus = DELETE 
else if (e->right links) 

nstatus = DERIVATION 
else if (E(S j+1 )) 

nstatus = EPSILON 

else 

nstatus = DELETE 

4.6.3 Open-Closed events (TOLE + FRLE): i > 1 and j = n. 

if ( ! e->rightlinks) 

nstatus = DELETE 
else if (e->lef tlinks) 

nstatus = DERIVATION 
else if (.E(Si-i)) 

nstatus = EPSILON 

else 

nstatus = DELETE 

4.6.4 Open-Open events (TOLE + TORI): i > 1 and j < n. 

if ((e->lef tlinks) && ( e->right links ) ) 

nstatus = DERIVATION 
else if ((e->lef tlinks) && (.E(S j+ i)) ) 

nstatus = EPSILON 
else if ((e->right links) && (E(<5i_i))) 



nstatus = EPSILON 

else 

nstatus = DELETE 

If nstatus is different that e->status we will change the logical status of the event. 
To improve the efficiency it is possible to maintain four lists of events (DERIVATION, 
RUN, DELETE and EPSILON). To change the status of an event implies to move the 
event from one list to another, but this may be done in constant time. 

4.6.5 Step 6: Parsing Cycle. 

This is the kernel of the algorithm: 
cycle = 1 
while (cycle) 
cycle = 

if (event = GetEpsilonEvent () ) 
cycle = 1 

EpsilonExpans ion (event) 
else if (event = GetDeleteEvent () ) 
cycle = 1 

DeleteEvent (event) 
else if (event = GetRunEvent () ) 

cycle = 1 

RunEvent (event) 
else if (link = GetFusionLinkO ) 

cycle = 1 

FusionLink(link) 

The functions Get* return the first element of the correspondent list and change 
the head of the list to the following element, which are constant operations. 

6.1. - Epsilon Expansion. This operation moves the left dot one position to the 
left or the right dot one position to the right, depending on the open extreme marked 
as EPSILON. 

6.2. - Delete Event. To delete an event implies to delete it and their links. 

6.3. - Run Event. To run a closed-closed event involves the application of a gram- 
mar rule, incorporating a new node (step 2). But if this node has been previously 
created between the same CaD structures, we can obtain a representation model based 
on subtree-sharing and local ambiguity packing, associating the analysis correspondent 
to the last one with the previously created node. This way, a node will have a list of 
Analysis structures, and this structure is defined as a list of Node structures. The 
result of this mechanism is a representation based on virtual relations between the 
skeleton of the parse forest and the nodes included in it. 

6.4. - Fusion Events. Let us consider the production S — ► 6± . . .6^ . . . 5i5i + \ . . .Sj . . .5, 
and two events e\ : 5 — ► <5i ...» Sh ... Si • <5j+i . . . 5 n and e2 : 5 — ► <5i . . . <5j • <5j+i . . . S 3 



If there exists a fusion link (exevlink) between ei (rightlinks) and e 2 (leftlinks) in 
the context the application of their fusion will generate the following actions: 

Case 6.4.1: Fusion with Double Derivation: 

if ((ei->rightlinks) && (e 2 ->lef tlinks) ) 

Create a new event e n : 5 — > 8\ . . . • 5^ . . . . . . Sj • . . . S n 

Case 6.4.2: Fusion with Single Right Derivation: 
else if (ei->rightlinks) 

Modify e 2 : 5 — ► 5i . . . • 5 h . . . 5i5 i+ i ...8j»...8 n 

Case 6.4.3: Fusion with Single Left Derivation: 

else if (e 2 ->lef tlinks) 

Modify e\ : 5 — ► 5\ . . . • 5^ . . . 5i5i+\ ■ . . 8j • . . . 8 n 

Case 6.4.4: Fusion without Derivation: 
else 

Modify ei : 5 — ► 5i . . . • 5 h . . . ■■■S j »...S n 
Delete e 2 



5 Implementation and Experimental Results 

This algorithm has been implemented in C including a specific layer for the memory 
management that improves the classical operations of malloc and free. 

Our experimental results show that this algorithm fully eliminates parsing ambi- 
guity for recursive, local and nondocal dependency constructions. For this kind of 
phenomena, the experimental results show a real complexity of the order 0(nlog(n)) 
where n is the length of the input string. f\ 

Next, we show the predicted model obtained for each type of grammar. The depen- 
dent variable T is the time used for the complete analysis (in seconds) and the factor 
used, W, has been the length of the input string (number of words). 

We show the results obtained with a Simple Lineal Regression Test for two cases. 
The first one uses T as the response and W log(W) as the factor. The second one uses 
T/W as the response and the same factor W log(W). In addition, we have included 
Pearson Correlation Coefficients for both cases. 

• Recursive Constructions: 

T = -5.183 + 219E - 7 * (Wlg{W)); PCC(T, Wlg(W)) = 0.999 
T/W = 0.0001 + A6E - 13 * (Wlg(W)); PCC(T/W, Wlg{W)) = 0.974 

• Local Dependencies 

T = -17.82 + 352£ - 7 * (Wlg(W)); PCC(T, Wlg(W)) = 0.993 
T/W = 0.0002 + 38E - 12 * (Wlg{W)); PCC(T/W, Wlg{W)) = 0.998 

• Non-local Dependencies 
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uesada 1997] contains a full description of the algorithm as well as a more detailed analysis of the 



experiments, including the grammars, string of words and results. 



T = -1.031 + 851 - 8 * (Wlg(W)); PCC(T, Wlg(W)) = 0.998 
T/W = 0.0002 + UE - 12 * (Wlg(W)); PCC(T/W, Wlg{W)) = 0.998 



6 Conclusion 

The problem of parsing natural languages must be studied from three perspectives: 
computational, linguistic and formal. In this paper we have presented a general, sound 
and efficient natural language parsing algorithm which accomplishes the main require- 
ments of the three levels. 

The computational layer includes a specific memory management model and a 
strategy for grammar compilation. This module has been designed with the goal of 
efficiency. The linguistic layer is in charge of general applicability, and includes basically 
a mecanism for the integration of the algorithm with unification grammar. Finally, at 
the formal level, the mathematical kernel proposed permits the demostration of the 



correctness and soundness of the algorithm Quesada 1997 ]. 

This paper has concentrated on the description of the algorithm itself, describing 
the data model and the parsing strategy. 
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